Search CORE

12 research outputs found

Data sparsity in highly inflected languages: the case of morphosyntactic tagging in Polish

Author: Ustaszewski Michael
Publication venue
Publication date: 01/01/2016
Field of study

In morphologically complex languages, many high-level tasks in natural language processing rely on accurate morphosyntactic analyses of the input. However, in light of the risk of error propagation in present-day pipeline architectures for basic linguistic pre-processing, the state of the art for morphosyntactic tagging is still not satisfactory. The main obstacle here is data sparsity inherent to natural lan- guage in general and highly inflected languages in particular. In this work, we investigate whether semi-supervised systems may alleviate the data sparsity problem. Our approach uses word clusters obtained from large amounts of unlabelled text in an unsupervised manner in order to provide a su- pervised probabilistic tagger with morphologically informed features. Our evalua- tions on a number of datasets for the Polish language suggest that this simple technique improves tagging accuracy, especially with regard to out-of-vocabulary words. This may prove useful to increase cross-domain performance of taggers, and to alleviate the dependency on large amounts of supervised training data, which is especially important from the perspective of less-resourced languages

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital para la Docencia y la Investigación

TransBank: Metadata as the Missing Link Between NLP and Traditional Translation Studies

Author: Stauder Andy
Ustaszewski Michael
Publication venue
Publication date
Field of study

Despite the growing importance of data in translation, there is no data repository that equally meets the requirements of translation industry and academia alike.Therefore, we plan to develop a freely available, multilingual and expandable bank of translations and their source texts aligned at the sentence level. Special emphasis will be placed on the labelling of metadata that precisely describe the relations between translated texts and their originals. This metadata-centric approach gives users the opportunity to compile and download custom corpora on demand. Such a general-purpose data repository may help to bridge the gap between translation theory and the language industry, including translation technology providers and NLP.(VLID)2371561Version of recor

University of Innsbruck Digital Library

Online Versus Offline NMT Quality: An In-depth Analysis on English-German and German-English

Author: Besacier Laurent
Elbayad Maha
Esperança-Rodier Emmanuelle
Manquat Francis Brunet
Ustaszewski Michael
Verbeek Jakob
Publication venue
Publication date: 01/01/2020
Field of study

We conduct in this work an evaluation study comparing offline and online neural machine translation architectures. Two sequence-to-sequence models: convolutional Pervasive Attention (Elbayad et al. 2018) and attention-based Transformer (Vaswani et al. 2017) are considered. We investigate, for both architectures, the impact of online decoding constraints on the translation quality through a carefully designed human evaluation on English-German and German-English language pairs, the latter being particularly sensitive to latency constraints. The evaluation results allow us to identify the strengths and shortcomings of each model when we shift to the online setup.Comment: Accepted at COLING 202

arXiv.org e-Print Archive

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Fantinuoli, Claudio (Hg.) (2018): Interpreting and technology. (Translation and Multilingual Natural Language Processing 11). Berlin: Language Science Press. 149 S.

Author: Ustaszewski Michael
Publication venue: Uniwersytet Wrocławski. Oficyna Wydawnicza ATUT – Wrocławskie Wydawnictwo Oświatowe
Publication date: 01/01/2020
Field of study

Biblioteka Nauki - repozytorium artykuÅÃ³w

Optimising the Europarl corpus for translation studies with the EuroparlExtract toolkit

Author: Ustaszewski Michael
Publication venue
Publication date
Field of study

The freely available European Parliament Proceedings Parallel Corpus, or Europarl, is one of the largest multilingual corpora available to date. Surprisingly, bibliometric analyses show that it has hardly been used in translation studies. Its low impact in translation studies may partly be attributed to the fact that the Europarl corpus is distributed in a format that largely disregards the needs of translation research. In order to make the wealth of linguistic data from Europarl easily and readily available to the translation studies community, the toolkit ‘EuroparlExtract has been developed. With the toolkit, comparable and parallel corpora tailored to the requirements of translation research can be extracted from Europarl on demand. Both the toolkit and the extracted corpora are distributed under open licenses. The free availability is to avoid the duplication of effort in corpus-based translation studies and to ensure the sustainability of data reuse. Thus, EuroparlExtract is a contribution to satisfy the growing demand for translation-oriented corpora.(VLID)2711019Version of recor

University of Innsbruck Digital Library

EuroparlExtract - Directional Parallel Corpora Extracted from the European Parliament Proceedings Parallel Corpus

Author: Ustaszewski Michael
Publication venue
Publication date
Field of study

This dataset contains directional parallel corpora extracted from the European Parliament Proceedings Corpus (Europarl) v7 created by Philipp Koehn (see http://www.statmt.org/europarl/). For the extraction, the EuroparlExtract corpus processing toolkit by Michael Ustszewski (2017) was used. EuroparlExtract is freely available under the MIT License (see https://github.com/mustaszewski/europarl-extract)

ZENODO

Syntactic complexity as a stylistic feature of subtitles

Author: Stauder Andy
Ustaszewski Michael
Publication venue: Uniwersytet Wrocławski. Oficyna Wydawnicza ATUT – Wrocławskie Wydawnictwo Oświatowe
Publication date: 01/01/2020
Field of study

In audiovisual translation, stylometry can be used to measure formal-aesthetic fidelity. We present a corpus-based measure of syntactic complexity as a feature of language style. The methodology considers hierarchical dimensions of syntactic complexity, using syllable counting and dependency parsing. The test material are dialogues of several characters from the TV show “Two and a Half Men”. The results show that characters do not differ syntactically among themselves as much as might be expected, and that, despite a general tendency to level differences even more in translation, the changes in syntactic complexity between the original and translation depend mostly on the respective character-feature combination

Biblioteka Nauki - repozytorium artykuÅÃ³w

Online Versus Offline NMT Quality: An In-depth Analysis on English–German and German–English

Author: Besacier Laurent
Brunet-Manquat Francis
Elbayad Maha
Esperança-Rodier Emmanuelle
Ustaszewski Michael
Verbeek Jakob
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

International audienceWe conduct in this work an evaluation study comparing offline and online neural machine translation architectures. Two sequence-to-sequence models: convolutional Pervasive Attention (Elbayad et al., 2018) and attention-based Transformer (Vaswani et al., 2017) are considered. We investigate, for both architectures, the impact of online decoding constraints on the translation quality through a carefully designed human evaluation on English-German and German-English language pairs, the latter being particularly sensitive to latency constraints. The evaluation results allow us to identify the strengths and shortcomings of each model when we shift to the online setup

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Optimising the Europarl corpus for translation studies with the EuroparlExtract toolkit

Author: Agerri R.
Bastian M.
Gale W. A.
Graën J.
Harzing A.-W.
Ji M.
Kenny D.
Koehn P.
Kurokawa D.
Malamatidou S.
Mellinger C. D.
Michael Ustaszewski
Tan L.
Zanettin F.
Publication venue: 'Informa UK Limited'
Publication date
Field of study

Crossref

The Alpine nappe stack in western Austria: a crustal-scale cross section

Author: A Wetzel
A Wetzel
AO Pfiffner
AO Pfiffner
AO Pfiffner
B Studer
Bernhard Fügenschuh
CW Gümbel
E Colins
F Allemann
F Allemann
G Cassinis
G Dohr
G Wyssling
H Ortner
H Scholz
Hannah Pomella
HD Sinclair
HP Laubscher
Hugo Ortner
J Kuhlemann
JC Lihou
JF Raumer von
JH Behrmann
K Schwerd
K Schwerd
K Ustaszewski
L Hitz
M Bischof
M Giamboni
M Moser
M Müller
M Müller
M Weh
Michael Zerlauth
MW Rasser
P Diebold
P Herrmann
P Herrmann
P Herrmann
PA Allen
R Oberhauser
R Oberhauser
R Oberhauser
R Oberhauser
RP Allenbach
S Aichholzer
SM Schmid
T McCann
T Vollmayr
T Vollmayr
U Schaltegger
W Cavazza
W Fuchs
W Heissel
W Heissel
W Huf
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref